[WIP] Stable Diffusion 3.x and Flux Optimization #22986

tianleiwu · 2024-12-02T23:35:19Z

Description

This work is in progress.

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models (fp32 or fp16).

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models:

python optimize_pipeline.py -i ./flux1_schnell_onnx/fp32 -o ./flux1_schnell_onnx/fp16 --float16

  Optimize flux1_schnell_onnx/fp32/transformer/model.onnx ...
  Fused LayerNormalization: 115
  Fused SimplifiedLayerNormalization: 152
  Fused FastGelu: 76
  Fused MultiHeadAttention: 57

H100 Benchmark Results

GPU: NVIDIA H100 80GB HBM3
Image Size: 1024x1024
Batch Size: 1

Model	Steps	Precision	Engine	Latency (Seconds)	GPU Memory (MB)
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (compile)	8.198	37,603
Flux 1.0 Dev	50	FP16+BF16	Optimum (ORT)	10.762	41,469
Flux 1.0 Dev	50	FP16+FP32	Optimum (ORT)	10.891	43,545
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (eager)	12.339	36,651
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (compile)	0.775	37,857
Flux 1.0 Schnell	4	FP16+BF16	Optimum (ORT)	0.931	41,433
Flux 1.0 Schnell	4	FP16+FP32	Optimum (ORT)	0.939	43,809
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (eager)	1.120	36,629
SD 3.5 Large	50	BF16	Torch 2.5.1 (compile)	7.466	32,217
SD 3.5 Large	50	FP16+BF16	Optimum (ORT)	10.275	36,609
SD 3.5 Large	50	FP16+FP32	Optimum (ORT)	10.283	36,729
SD 3.5 Large	50	BF16	Torch 2.5.1 (eager)	11.615	31,517
SD 3.5 Medium	50	BF16	Torch 2.5.1 (compile)	3.240	21,143
SD 3.5 Medium	50	FP16+BF16	Optimum (ORT)	4.799	25,097
SD 3.5 Medium	50	FP16+FP32	Optimum (ORT)	4.838	25,109
SD 3.5 Medium	50	BF16	Torch 2.5.1 (eager)	5.582	20,489

A100 Benchmark Results

GPU: A100-SXM4-80GB
Image Size: 1024x1024
Batch Size: 1

Model	Steps	Precision	Engine	Latency (Seconds)	GPU Memory (MB)
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (compile)	17.593	37,723
Flux 1.0 Dev	50	FP16+BF16	Optimum (ORT)	21.918	41,348
Flux 1.0 Dev	50	FP16+FP32	Optimum (ORT)	22.060	44,860
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (eager)	24.267	36,847
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (compile)	1.627	37,881
Flux 1.0 Schnell	4	FP16+BF16	Optimum (ORT)	1.884	41,537
Flux 1.0 Schnell	4	FP16+FP32	Optimum (ORT)	1.902	44,858
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (eager)	2.162	36,831
SD 3.5 Large	50	BF16	Torch 2.5.1 (compile)	15.881	32,307
SD 3.5 Large	50	FP16+FP32	Optimum (ORT)	19.837	36,451
SD 3.5 Large	50	FP16+BF16	Optimum (ORT)	19.964	36,461
SD 3.5 Large	50	BF16	Torch 2.5.1 (eager)	22.477	31,513
SD 3.5 Medium	50	BF16	Torch 2.5.1 (compile)	6.476	21,341
SD 3.5 Medium	50	FP16+FP32	Optimum (ORT)	8.775	25,183
SD 3.5 Medium	50	BF16	Torch 2.5.1 (eager)	10.057	20,433

Future Works

Triton kernel for matrix multiplication and auto tuning.
FP8/Int8 quantization

Motivation and Context

SD 3.5 Architecture:
https://huggingface.co/stabilityai/stable-diffusion-3.5-medium/resolve/main/mmdit-x.png

onnxruntime/python/tools/transformers/onnx_model_mmdit.py

onnxruntime/python/tools/transformers/fusion_fastgelu.py

@@ -358,3 +361,122 @@
        self.nodes_to_add.append(fused_node)
        self.node_name_to_graph_name[fused_node.name] = self.this_graph_name
        return True
+
+    def fuse_4(self, tanh_node, input_name_to_nodes: Dict, output_name_to_node: Dict) -> Optional[bool]:


onnxruntime/python/tools/transformers/onnx_model_mmdit.py

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py

onnxruntime/python/tools/transformers/onnx_model_mmdit.py

+        # if (options is None) or options.enable_skip_layer_norm:
+        #    self.fuse_skip_simplified_layer_norm()
+        #    self.fuse_skip_layer_norm()
+        # if (options is None) or options.enable_bias_skip_layer_norm:
+        #     # Fuse SkipLayerNormalization and Add Bias before it.
+        #     self.fuse_add_bias_skip_layer_norm()


tianleiwu added 2 commits November 22, 2024 13:59

initial

6fb7369

sd3.x and flux

9b2dcc0

github-advanced-security bot found potential problems Dec 2, 2024

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_mmdit.py Fixed Show fixed Hide fixed

tianleiwu marked this pull request as draft December 3, 2024 19:19

update FastGelu and RMSNorm fusions

7f925ce

github-advanced-security bot found potential problems Dec 5, 2024

View reviewed changes

tianleiwu added 7 commits December 6, 2024 00:30

support Reciprocal in RMSNorm fusion

cf259e1

match_child_path interface change

b38f12e

clean up

a58b68c

MHA fusion for MMDit

c7317cb

cuda layernorm support broadcast

2f5b9b9

force fuse layernorm

699a64c

refactoring

c1d0160

github-advanced-security bot found potential problems Dec 15, 2024

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_mmdit.py Fixed Show fixed Hide fixed

ACinfr reviewed Dec 16, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py Outdated Show resolved Hide resolved

mha fusion for flux

1b9ea54

github-actions bot reviewed Dec 19, 2024

View reviewed changes

github-advanced-security bot found potential problems Dec 19, 2024

View reviewed changes

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py Fixed Show fixed Hide fixed

remove transpose for query

5528276

github-advanced-security bot found potential problems Dec 20, 2024

View reviewed changes

tianleiwu added 9 commits December 23, 2024 04:34

t5 optimization and mixed precision conversion

89950d1

fix node name

c869151

Add option to use bfloat16

84b1a51

fix attention

b7041d1

update node block list of t5 encoder

455a3ea

benchmark torch eager mode

dad0ac4

update comment

8400558

benchmark torch compile

9e43e20

refine benchmark_flux.sh

4bf9f25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stable Diffusion 3.x and Flux Optimization #22986

[WIP] Stable Diffusion 3.x and Flux Optimization #22986

tianleiwu commented Dec 2, 2024 •

edited

Loading

github-actions bot left a comment

[WIP] Stable Diffusion 3.x and Flux Optimization #22986

Are you sure you want to change the base?

[WIP] Stable Diffusion 3.x and Flux Optimization #22986

Conversation

tianleiwu commented Dec 2, 2024 • edited Loading

Description

H100 Benchmark Results

A100 Benchmark Results

Future Works

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

tianleiwu commented Dec 2, 2024 •

edited

Loading